HIGH-PERFORMANCE SOLVERS FOR DENSE HERMITIAN 

EIGENPROBLEMS* 



M. PETSCHOW 1 ", E. PEISEt, AND P. BIENTINESF 

Abstract. 

We introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense 
Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and 
tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own 
right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal 
solver based on the algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. 
Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is 
designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As 
a result, the solvers are equally effective in message-passing environments with and without SMP 
parallelism. We conducted a thorough performance study of EleMRRR and ScaLAPACK's solvers 
on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines 
to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR 
outperforms even the fastest solvers built from ScaLAPACK's components. 
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1. Introduction. In this section we briefly state the considered eigenproblems, 
give a short description of our high-level approach, and list the main contributions of 
this paper. 

1.1. The Problem. A generalized Hermitiai^ eigenproblem (GHEP) is identi- 
fied by the equation 

Ax = XBx, (1.1) 

where A,B e C" x " are known matrices; the sought after scalars A and associated 
vectors x 6 C™, x ^ 0, are called eigenvalues and eigenvectors, respectively. We say 
that (A, x) is an eigenpair of the pencil (A, B). In the following, we make the additional 
assumption that either A or B is positive or negative definite, which implies that all 
the eigenvalues are real. Without loss of generality, we assume that B is positive 
definite. Thus, when referring to the GHEP of Eq. (|1.1[) . the restriction to Hermitian- 
definite pencils (A, B) is subsequently implied. If B is the identity matrix /, Eq. (jl.ljl 
reduces to the standard Hermitian eigenproblem (HEP) Ax = Ax; if A is also real 
and tridiagonal, the problem is referred to as a symmetric tridiagonal eigenproblem 
(STEP). 

Different numerical methods were devised to solve instances of the GHEP, HEP, 
and STEP in accordance to the amount of eigenpairs requested and additional prop- 
erties of the involved matrices (e.g., sparsity). In this paper, we concentrate on the 
efficient solution of dense generalized eigenproblems on modern distributed-memory 
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1 That is A = A* and B = B* . If both Hermitian matrices A and B are real valued, then the 
following discussion also holds with the words 'Hermitian' and 'unitary' respectively replaced by 
'symmetric' and 'orthogonal'. 
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platforms composed of shared-memory multiprocessor nodes; as a by-product, we also 
obtain results for standard and tridiagonal problems. 

1.2. Three Nested Eigensolvers. Common methods to the dense generalized 
eigenproblems make use of the following fact: given non-singular matrices G and F, 
the eigenvalues of the pencil {A, B) are invariant under the equivalence transformation 
(GAF,GBF); furthermore, x is an eigenvector of (A, B) if and only if F~ x x is an 
eigenvector of (GAF, GBF) [34]. 

The most versatile tool for the generalized cigcnproblcm, the QZ algorithm [33 , 
uses a sequence of unitary equivalence transformations to reduce the original pencil to 
generalized (real) Schur form. By design, the QZ algorithm is numerically backward 
stable and imposes no restrictions on the input matrices; unfortunately, the algorithm 
does not respect the symmetry of the Hermitian pencil (A, B) and is computationally 
rather costly. 

To preserve the symmetry of the problem while reducing the pencil (^4, B) to 
simpler form, methods are limited to sequences of congruence transformations - that 
is, using G = F* , where G and F are no longer required to be unitary. The traditional 
approach for computing all or a significant fraction of the eigenpairs of (A, B) relies on 
a reduction-backtransformation procedure, corresponding to three nested eigensolvers: 
generalized, standard, and tridiagonal. Since B is positive definite, the Cholesky 
factor can be used to transform the GHEP to a HEP [32], which in turn is reduced 
to a STEP; once the eigenvectors of the STEP are computed, they are mapped back 
to those of the original problem via two successive backtransformations. Overall, the 
process for solving a generalized eigenproblem - also known as the Cholesky- Wilkinson 
method - consists of six stages: 

1. Cholesky factorization: B is factored into LL* , where L is lower triangular. 

2. Reduction to standard form: From B = LL*, the original pencil (A,B) is 
transformed to (L~ 1 AL~* 1 I) and Eq. takes the form of a standard 
Hermitian eigenproblem My = Xy. The eigenvectors x of the GHEP and the 
eigenvectors of the HEP are related by y = L*x. 

3. Reduction to tridiagonal form: A unitary matrix Q is computed such that 
T = Q*MQ is real symmetric tridiagonal. The pencil (M,I) is transformed 
to (Q*MQ, I), corresponding to the tridiagonal eigenproblem Tz = \z, where 
z = Q*y. 

4. Solution of the tridiagonal eigenproblem: A set of eigenpairs (A, z) is com- 
puted such that Tz — Xz; the existence of n such eigenpairs (A, z) for which 
the set of eigenvectors forms an orthonormal basis for C™ is ensured by the 
Spectral Theorem [34]. 

5. First backtrans formation: In accordance to Stage 3, the eigenvectors of the 
standard eigenproblem are obtained by computing y = Qz. 

6. Second backtransformation: In accordance to Stage 2, the eigenvectors of the 
original pencil are obtained by computing x — L~*y. 

The above discussion shows that when the k computed eigenvalues are the entries of 
a diagonal matrix A S R fcxfe , and the associated eigenvectors are the columns of a 
matrix X £ C" xfc , then X* AX = A, X*BX = I, and AX = BXA. 

With slight modifications of Stages 2 and 6, the same six-stage procedure also 
applies to eigenproblems in the form ABx = Xx and BAx — Xx. In the first case, 
the reduction and the backtransformation become M — L*AL and x = L~*y, respec- 
tively; in the second case, they become M = L* AL and x = Ly. 

The six-stage procedure should only be used if B is sufficiently well-conditioned 
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with respect to inversion. For a detailed discussion of the properties of the GHEP, 
especially regarding perturbation theory and the problems arising for ill-conditioned 
B, we refer to the standard literature, including [TT1 IT91 IMl 145] . 

1.3. The Impact and Contributions of this Paper. The solution of the 
aforementioned types of large-scale eigenproblems is integral to a number of scientific 
disciplines, particularly vibration analysis [I] and quantum chemistry [2"2l [571 [Ml PIS] . 
An example of an application where the solution of the eigenproblem is the most time 
consuming operation is Density Functional Theory [5^1 HH HE]. There, as part of 
a simulation, one has to solve a set of equations in a self-consistent fashion; this is 
accomplished by an iterative process in which each iteration involves the solution of 
dozens or even hundreds of generalized eigenproblems. In many cases, the application 
requires 5-25% of the eigenpairs associated with the smallest eigenvalues, and the size 
of such problems is usually in the tens of thousands of degrees of freedom. In other 
applications, in which the simulations do not follow an iterative process, it is instead 
common to encounter only one single eigenproblem - potentially of very large size 
(50k-100k or more) [27, 29 . In both scenarios, the problem size is not limited by the 
physics of the problem, but only by memory requirements and time-to-solution. 

When the execution time and/or the memory requirement of a simulation become 
limiting factors, scientists place their hopes on massively parallel supercomputers. 
With respect to execution time, the use of more processors would ideally result in 
faster solutions. When memory is the limiting factor, additional resources from large 
distributed-memory environments should enable the solution of larger problems. We 
study the performance of eigensolvers for both situations: increasing the number of 
processors while keeping the problem size constant (strong scaling), and increasing 
the number of processors while keeping the memory per node constant (weak scaling). 

In this paper we make the following contributions: 

• Given the nested nature of the generalized, standard and tridiagonal eigen- 
solvers, the last is both a solver in its own right and the computational 
kernel for the HEP and the GHEP. We present a novel tridiagonal solver, 
PMRRR, based on the algorithm of Multiple Relatively Robust Represen- 
tations (MRRR) [13j [14], which merges the distributed and multithreaded 
approaches first introduced in [S] and [35] H PMRRR is well suited for both 
single node and large scale massively parallel computations. Experimental 
results indicate that PMRRR is currently the fastest tridiagonal solver avail- 
able, outperforming all the solvers included in LAPACK [2 , ScaLAPACK [7J 
and Intel's Math Kernel Library (MKL). 

• We introduce EleMRRR (from Elemental and PMRRR) , a set of distributed- 
memory eigensolvers for generalized and standard Hermitian eigenproblems. 
EleMRRR provides full support for hybrid message-passing and multithread- 
ing parallelism. If multithreading is not desired, EleMRRR can be used in 
purely message-passing mode. 

• The five stages of reduction and backtransformation in EleMRRR are based 
on Elemental, a library for the development of distributed-memory dense 
linear algebra routines 39 . Elemental embraces a two-dimensional cyclic 
element-wise matrix distribution, and attains performance comparable or 
even superior to the well established ScaLAPACK, PeIGS and PLAPACK 
parallel libraries [7j ?,[49]. For the reduction to standard form, an algorith- 



2 PMRRR should not be confused with the distributed-memory solver introduced in \E\. 
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mic variant is used which delivers high performance and scalability [40 . 
• A thorough performance study on two high-end computing platforms is pro- 
vided. This study accomplishes two objectives. On the one hand it con- 
tributes guidelines on how to build - within the ScaLAPACK framework - 
an eigensolver faster than the existing ones. This is of particular interest to 
computational scientists and engineers as each of the commonly use rou- 
tines (PZHEGVX, PDSYGVX, PZHEEVD, PDSYEVD) present performance penalties 
that can be avoided by calling a different sequence of subroutines and choos- 
ing suitable settings. On the other hand, the study indicates that ElcMRRR 
is scalable - both strongly and weakly - to a large number of processors, and 
outperforms the standard ScaLAPACK solvers. 
The paper is organized as follows: In Section [2] we discuss related work and give 
experimental evidence that some widely used routines fail to deliver the desired perfor- 
mance. In Section [3l we concentrate on EleMRRR, with emphasis on the tridiagonal 
stage - PMRRR. We present a thorough performance study on two state-of-the-art 
high-performance computer systems in Section^ We show that ScaLAPACK contains 
a set of fast routines that can be combined to avoid the aforementioned performance 
problems, and we compare the resulting routines to our solver EleMRRR. We sum- 
marize our findings in Section [SJ 

2. A Study of Existing Solvers. In this section we give a brief overview of 
existing methods and study well-known issues of widely used routines available in the 
current versiorQ of the ScaLAPACK library. We discuss the generalized, standard 
and symmetric tridiagonal eigenproblems in succession. 

2.1. The Generalized Eigenproblem. In some cases, even if A and B do 
not satisfy the assumptions of Section 11.11 the problem can still be transformed into 
one that exhibits the desired properties [11] . In general, if A and B are dense but 
non-Hermitian or if B has poor conditioning with respect to inversio instead of 
the aforementioned six-stage approach, the QZ algorithm can be used. A parallel 
distributed- memory implementation is discussed in pQ. 

The ScaLAPACK library contains routines for the three classes of eigenproblems 
we consider in this paper. A complete list of routine names for both the solvers and 
the individual stages is given in Table 12.11 In particular, PZHEGVX and PDSYGVX are 
ScaLAPACK 's double precision drivers for complex Hermitian and real symmetric 
generalized eigenproblems, respectively. 

In Fig. 12.11 we report the weak scalability of PZHEGVX for computing 15% of the 
eigenpairs of Ax — XBx associated with the smallest eigenvalues^ The left graph 
indicates that, as the problem size and the number of processors increase, PZHEGVX 
does not scale as well as the EleMRRR solver presented in Section |3j In the right 

3 See for example [46l ITOl l26l US] . 

4 At the time of writing, ScaLAPACK's latest version was 1.8. The version, 2.0, presents no 
significant changes in the tested routines. 

5 If B is (nearly) semi-definite, instead of the QZ algorithm, one can use a variant of the reduction 
to standard form introduced by [?]. However, in contrast to the QZ algorithm, this variant is not 
included in any of the most widely used libraries. 

6 The timings were generated on the Juropa supercomputer; a detailed description of the archi- 
tecture and the experimental setup is provided in Section [4] We used all default parameters. In 
particular the parameter orfac that indicates which eigenvectors should be re-orthogonalized during 
Inverse Iteration has the default value 10 — 3 . In practice, it is possible to use a less restrictive choice 
to reduce the execution time to some extent. In order to exploit ScaLAPACK's fastest reduction 
routines, the lower triangular part of the matrices is stored and referenced. 
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Stage 


Complex 


Real 


Description 


1 — 6 


PZHEGVX 


PDSYGVX 


/— 1 T TTT'T"* J~> " J. - J T Tj. J_ " 

GHEP — Bisection and Inverse Iteration 


3-5 


PZHEEVX 


PDSYEVX 


HEP — Bisection and inverse Iteration 


3-5 


PZHEEV 


PDSYEV 


HEP — QR algorithm 


3-5 


PZHEEVD 


PDSYEVD 


HEP — Divide and Conquer algorithm 


3-5 


PZHEEVR 


PDSYEVR 


HEP — MRRR algorithm 


1 




PDSTEDC 


STEP — Divide and Conquer algorithm 


1 




PDSTEGR 


S 1 EP — MRRR algorithm 


1 


PZPOTRF 


PDPOTRF 


Cholesky factorization 


2 


PZHEGST 


PDSYGST 


Reduction to HEP 


2 


PZHENGST 


PDSYNGST 


Reduction to HEP (square grid of processes) 


3 


PZHETRD 


PDSYTRD 


Reduction to STEP 


3 


PZHENTRD 


PDSYNTRD 


Reduction to STEP (uses PxxxTTRD for square grid of processes) 


4 




PDSTEBZ 


Eigenvalues of a tridiagonal matrix using Bisection 


1 


PZSTEIN 


PDSTEIN 


Eigenvectors of a tridiagonal matrix using Inverse Iteration 


5 


PZUNMTR 


PDORMTR 


Backtransformation to HEP 


6 


PZTRSM 


PDTRSM 


Backtransformation to GHEP (Type #1 or #2) 


6 


PZTRMM 


PDTRMM 


Backtransformation to GHEP (Type #3) 



Table 2.1 

List of relevant ScaLAPACK routine names. 



graph we show the breakdown of the execution time for each of the six stages. 

Independently of the input data, all the considered solvers perform a similar 
number of floating point operations in each of the Stages 1, 2, 3, 5, and 6; on the 
contrary, the complexity of Stage 4 depends on the input data, and varies from one 
method to another. With the exception of this stage, a comparison purely based on 
operation counts would be misleading; differences in execution time are mainly due 
to different use of the memory hierarchy and exploitation of parallelism. Throughout 
the paper, we rely on the execution time as the performance metric. 

In Fig. ETH it is evident that the routines PDSTEBZ and PZSTEIN, which implement 
the Bisection and Inverse Iteration (BI) tridiagonal eigensolver, are the main cause 
for the poor performance of PZHEGVX. For the problem of size 20,000, these routines 
are responsible for almost 90% of the compute time. BI's poor performance is a 
well understood phenomenon, e.g. [8], directly related to the effort necessary to re- 
orthogonalize eigenvectors corresponding to clustered eigenvalues. This issue led to 
the development of an improved version of Inverse Iteration, the MRRR algorithm, 
that avoids re-orthogonalization even when the eigenvalues are clustered. In addition 
to the performance issue, PZHEGVX also suffers from memory imbalances, as all the 
eigenvalues belonging to a cluster are computed on a single processor. 

Guideline 1. In light of the above considerations, the use of ScaLAPACK 's 
routines based on Bisection and Inverse Iteration (BI) is not recommended. 

We do not provide further comparisons between ElcMRRR and PZHEGVX or 
PDSYGVX. Instead, in Section [4] we illustrate how the performance of these drivers 
changes when the BI algorithm for the tridiagonal eigensolver is replaced with other - 
faster - methods available in ScaLAPACK, namely the Divide and Conquer algorithm 
(DC) and the MRRR algorithm [47l|50]. 

2.2. The Standard Eigenproblem. We restrict the discussion to dense Her- 
mitian problems. Such problems arise in a variety of applications, ranging from elec- 
trodynamics to macro-economics [UJ; they are also often solved as part of a GHEP, 
as discussed in Section [TJ 
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Fig. 2.1. Weak scalability for the computation of 15% of the eigenpairs associated with 
the smallest eigenvalues. As done commonly in practice \46^ , the eigenvectors are requested to be 
numerically orthogonal. Left: Total execution time of PZHEGVX and EleMRRR. Right: Fraction of 
the execution time spent in the six stages of PZHEGVX, from bottom to top (see Table lKTl l. 



The standard approach to solve a standard eigenproblem consists of three stages, 
corresponding to Stages 3-5 in the solution of the generalized problem: (a) Reduction 
to a real symmetric tridiagonal form; (b) Solution of the tridiagonal eigenproblem; 
(c) Backtransformation of the eigenvectors. A detailed analysis of the three-stage 
approach, concentrating on the first and third stage, can be found in |44j . 

An alternative approach is Successive Band Reduction (SBR) [6j. The idea is to 
split the reduction to tridiagonal form in two (or more) stages. In all but the last stage, 
the matrix is reduced to banded form with strictly decreasing bandwidths. Unlike the 
direct reduction to tridiagonal form, these stages can take full advantage of highly 
efficient kernels from the Basic Linear Algebra Subprograms (BLAS) library [17], thus 
attaining high-performance. The reduction is then completed with a final band-to- 
tridiagonal reduction. The downside of such a strategy lies in the accumulation of 
the orthogonal transforms. Thus, when a significant portion of the eigenvectors is re- 
quested, the SBR routines are not competitive. For this reason, SBR is normally used 
for computing only the eigenvalues or a small fraction of the eigenvectors. However, a 
recent publication suggests that, on highly parallel systems, SBR might be faster than 
direct reduction to tridiagonal form even when a significant fraction of eigenvectors 
is computed [3]. 

ScaLAPACK offers a number of routines for the standard Hermitian eigenprob- 
lem, each of which differs in its algorithm of choice for the tridiagonal eigenvalue prob- 
lem. The four routines PZHEEVX, PZHEEV, PZHEEVD, and the recently added PZHEEVR, 
implement BI [5l], the QR algorithm [2S [H], the DC algorithm [20], and the 
MRRR algorithm [H], respectively^ 

Among the solvers presently available in ScaLAPACK, only PZHEEVX (BI) and 
PZHEEVR (MRRR) offer the possibility of computing a subset of eigenpairs. PZHEEVX 
is widely used, even though, as highlighted in the previous section, it is highly non- 
scalable. Similarly, if eigenvectors are computed, the QR algorithm is known to be 
slower than DC for large problems [5] and thus the use of ScaLAPACK's routines 



7 Each of these routines has the usual counterpart for the real symmetric case: PDSYEVX, PDSYEV, 
PDSYEVD, and PDSYEVR. 
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based on QR is not recommended; in the future experiments, we omit comparisons 
with routines that are based on the QR algorithm or BI. 

We now focus on the weak scalability of the widely used routine, PZHEEVD, which 
uses DC for the tridiagonal eigenproblem. In Fig. 12.21 we show the results for PZHEEVD 
and ElcMRRR from an experiment similar to that of Fig. 12.11 Note that all eigenpairs 
were computed, since PZHEEVD docs not allow for subset computation. In the previous 
section, the example indicated that BI might dominate the runtime of the entire dense 
eigenproblem, while the DC method required less than 10% of the total execution time. 
Instead, as the matrix size increases, the reduction to tridiagonal form (PZHETRD) 
becomes the computational bottleneck, requiring up to 70% of the total time for the 
largest problem shown. A comparison of the left and right sides of Fig. 12.21 reveals 
that, for large problems, using PZHETRD for the reduction to tridiagonal form requires 
more time than the complete solution with EleMRRR. 

ScaLAPACK also includes PZHENTRD, a routine for the reduction to STEP es- 
pecially optimized for square processor grids. The performance improvement with 
respect to PZHETRD can be so dramatic that, for this stage, it is preferable to limit the 
computation to a square number of processors and redistribute the matrix accord- 
ingly 23 . It is important to note that the performance benefit of PZHENTRD can only 
be exploited if the lower triangle of the input matrix is stored, otherwise the slower 
routine, PZHETRD, is invoked @ 




Fig. 2.2. Weak scalability for the computation of all eigenpairs using DC. Left: Total execution 
time of PZHEEVD and EleMRRR. Right: Execution time for ScaLAPACK's routines PZHETRD and 
PZHENTRD, which are responsible for the reduction to tridiagonal form. The former, used within 
the routine PZHEEVD, causes a performance penalty and accounts for much of the time difference 
compared with EleMRRR. 



Guideline 2. ScaLAPACK's reduction routines fPxxxNGST and PxxxNTRD,) op- 
timized for square grids of processors are to be preferred over the regular reduction 
routines, even when non-square process grids are used; moreover, only the lower tri- 
angle of implicitly Hermitian matrices should be referenced. 

For performance and scalability reasons, in Section[4]we use the routines PxxxNTRD 
and PxxxNGST to build the fastest solver within the ScaLAPACK framework. 



"Similar considerations also apply for the reduction to standard form via the routines PZHEGST 
and PZHENGST, see [40]. 
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2.3. The Tridiagonal Eigenproblem. At the core of the reduction-backtrans- 
formation approach for the GHEP and the HEP is the symmetric tridiagonal eigen- 
problem. One of the main differences - and often the only difference - among solvers 
for generalized and standard problems lies in the method for this stage. As seen 
in Section 12.11 this might account for a significant computational portion of the en- 
tire solution process. In Section l2~2l we mentioned four methods: BI, QR, DC, and 
MRRR, and justified not providing experimental comparisons with the BI and QR 
approaches. 

As already mentioned, in contrast to all other stages of generalized and standard 
problems, the number of arithmetic operations of the tridiagonal eigensolver depends 
on the input data. In fact, depending on the matrix entries, either DC or MRRR 
may be faster. Fig. 12.31 provides an example of how performance is influenced by the 
input data. The algorithms are compared on two types of test matrices: "1-2-1" and 
"Wilkinson" . The former contains ones on the subdiagonals and twos on the diagonal; 
its eigenpairs are known analytically. In the latter, the subdiagonals contain ones and 
the diagonal equals the vector (m, m — 1, . . . , 1, 0, 1, . . . , m), with m = (n— l)/2. Due 
to the phenomenon of deflation, this matrix is known to favor the DC algorithm [9]. 
We also include timings for our solver, PMRRR; for both matrix types it eventually 
becomes the fastest solver. A detailed discussion of PMRRR follows in the next 
section. 




Fig. 2.3. Weak scalability for the computation of all eigenpairs of two different test matrix 
types. The left and right graphs have different scales. Left: "1-2-1" matrix; Right: "Wilkinson" 
matrix. In contrast to the results reported in \50h where a similar experiment comparing PDSTEDC 
and PDSTEGR is performed, even when the matrices offer an opportunity for heavy deflation, our 
PMRRR becomes faster than DC eventually due to its superior scalability. The scalability advantage 
of PMRRR compared with PDSTEGR is the result of non-blocking communications used in conjunction 
with a task-based algorithmic design that allows processes to continue computing while waiting for 
data. As an example, in the Wilkinson experiment with 1024 cores, ScaLAPACK's MRRR spends 
about 30 out of 50 seconds in exposed communication. 



We now provide a short comparison of PMRRR with ScaLAPACK's routines 
PDSTEDC and PDSTEGR, which implement the tridiagonal DC and MRRR algorithme@, 
respectively [47] [50]. 

In the worst case, PDSTEDC computes all eigenpairs at the cost of 0(n 3 ) floating 



9 PDSTEGR is not part of ScaLAPACK; it corresponds to the tridiagonal eigensolver used in PxxxEVR 
and is not encapsulated in its own routine. 
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point operations (flops); in practice, the flop count is lower due to deflation and the 
run time behavior is empirically n 2 5 [121 147] . Furthermore, most of the computation 
is cast in terms of fast BLAS-3 kernels [H] . 

While PDSTEDC cannot compute a subset of k < n eigenpairs, the MRRR routine 
PDSTEGR returns the eigenpairs at the reduced cost of 0(nk) flops. 

Guideline 3. Both of ScaLAPACK's tridiagonal eigensolver routines, PDSTEDC 
(DC) and PDSTEGR (MRRR), are generally fast and reasonably scalable; depending on 
the target architecture and specific requirements from the applications, either one may 
be used. Specifically, if only a small subset of the spectrum has to be computed, in 
terms of performance, the MRRR-based solvers are to be preferred to DC. 

In terms of accuracy, all tridiagonal solvers generally obtain accurate results in the 
sense that they achieve small residual norms and numerical orthogonality: Specifically, 
with machine precision e, 



\Tzj - XjZjW = 0(ne\\T\\) , and 

\z 3 z t 



0(ne), i^j, 



(2.1) 
(2.2) 



for all computed eigenpairs (\j,Zj) with ||2j|| 2 w 1. It is well-known that MRRR 
yields slightly worse coefficients for Eq. (|2.2|) than QR and DC [12] , In the rare 
cases where this is an issue, feasible alternatives are the DC algorithm and a mixed 
precision variant of MRRR [?]. On the downside, DC requires 0(n 2 ) extra memory; 
see [12l [50l ?] for further comparisons. 

In Section IH we include experimental data on generalized eigenproblems for both 
DC and MRRR. One of the challenges in building a scalable solver is that every 
stage must be scalable. This is illustrated in Fig. 12.41 which shows the results for 
ScaLAPACK's DC from the experiment detailed in Section 14.1.1 1 While the reduction 
to tridiagonal form and the tridiagonal eigensolver are respectively the most and the 
least expensive stages on 64 cores, on 2048 cores the situation is reversed. This 
behavior is explained by the parallel efficiencies shown in the right panel of Fig. [ 
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Fig. 2.4. Scalability of the computation of all eigenpairs using ScaLAPACK's DC. The details 
of the experiment are discussed in Section \4-.l.l\ 



3. Elemental's Generalized and Standard Eigensolvers. Elemental is a 
framework for dense matrix computations on distributed-memory architectures [39] ■ 
The main objective of the project is to ease the process of implementing matrix 
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operations without conceding performance or scalability. The code resembles a high- 
level description of the algorithm, and the details relative to data distribution and 
communication are hidden. Performance is achieved through a two-dimensional cyclic 
elemental (entry-wise) matrix distribution^ 

Elemental supports many frequently encountered dense matrix operations; these 
include the Cholesky, QR and LU factorizations, as well as the solution of linear 
systems. Recently, routines for Hcrmitian and symmetric eigenvalue problems have 
also been added. In particular, Elemental supports Hermitian-dcfinite generalized, 
standard Hcrmitian, as well as skew-Hermitian eigenproblems. All the eigensolvers 
follow the classical reduction and backtransformation approach described in Section[TJ 
Detailed discussions the reduction and backtransformation stages, both in general and 
within the Elemental environment, can be found in [331 |3H l2"51 HP] . 

In terms of accuracy, Elemental's solvers are equivalent to their sequential coun- 
terparts. We therefore do not report accuracy results and refer to [T2] instead. 

In terms of memory, Elemental's solvers are quite efficient. In Table [5TT1 we report 
approximate total memory requirements for computing k eigenpairs of generalized 
and standard eigenproblems. As a reference, we provide the same numbers for the 
solvers built from ScaLAPACK routines. In Section [31 we define the meaning of 
"ScaLAPACK DC" and "ScaLAPACK MRRR" precisely. The numbers in the table 
are expressed in units of the size of a single complex and real floating point number, 
depending on the required arithmetic. The memory requirement per process can be 
obtained by dividing by the total number of processes. 





Complex 


Real 






GHEP HEP 


GHEP 


HEP 


ScaLAPACK DC 


4n 2 3n 2 


5n 2 


An 2 


ScaLAPACK MRRR 


2n 2 + l.bnk n 2 + l.bnk 


2n 2 + 2nk 


n 2 + 2nk 


Elemental 


2n 2 + nk n 2 + nk 


2n 2 + nk 


n 2 + nk 


Table 3.1 



Approximate total memory requirements in units of complex and real floating point numbers for 
the computation of k eigenpairs. We assume that square process grids are used, as it is recommended 
for optimal performance and if memory usage is a concern. The numbers also hold when a non- 
square grid of processes is used in combination with the standard reduction routines. 

For square process grids, Elemental requires between 0.5n 2 to 2n 2 floating point 
numbers less memory than ScaLAPACK based solvers. If non-square grids are used, 
a user concerned about memory usage can make use of the non-square reduction 
routines - at the cost of sub-optimal performance. The reduction routines for square 
process grids would otherwise need a data redistribution that adds a n 2 term to 
Elemental and ScaLAPACK MRRR, but not to DC. On the other hand, in this 
situation, ScaLAPACK MRRR can save work space to perform its redistribution of 
the eigenvectors from a Id to a 2d block-cyclic data layout, reducing its terms by nk 
real floating point numbers. This eigenvector redistribution is performed in Elemental 
in-place, resulting in a smaller memory footprint than possible for ScaLAPACK's 
routines. 

For the solution of symmetric tridiagonal eigenproblems, Elemental incorporates 
PMRRR, a new parallel variant of the MRRR algorithm. PMRRR combines two 



ScaLAPACK instead deploys a block-cyclic matrix distribution. 
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solvers: one based on message-passing parallelism [SJ[5U], and another based on mul- 
tithreading [35]. As a consequence, the solver provides the user with multiple parallel 
programming models (multithreading, message-passing, a hybrid of both), so that 
the parallelism offered by modern distributed-memory architectures can be fully ex- 
ploited. In the following two subsections we focus the attention on PMRRR, our 
tridiagonal eigensolver. 

3.1. PMRRR as the Core of ElementaPs Eigensolver. The MRRR al- 
gorithm is a form of inverse iteration. Its salient feature is that the costly re- 
orthogonalization, necessary when the eigenvalues are clustered together, is entirely 
removed. As the experiment described in Fig. 12.11 shows, the re-orthogonalization 
may affect the execution time dramatically: in computing 15% of the eigenpairs as- 
sociated with the smallest eigenvalues of a matrix of size 20,000 using 512 cores, 
ScaLAPACK's Inverse Iteration takes about 404 seconds, while PMRRR requires less 
than 0.3 seconds. 

To achieve this performance, the classical inverse iteration was modified signifi- 
cantly. The basic procedure requires the repeated solution of linear systems: given an 
approximate eigenvalue Xj and a starting vector z^ , under mild assumption J^l the 
iterative process 

(T-V)-f +1) = sW *j° ' ( 3J ) 

with appropriate scaling factors s'*', results in an eigenvector approximation Zj that 
fulfills Eq. (|2.ip . Nevertheless, small residual norms do not guarantee orthogonality 
between independently computed eigenvectors - that is Eq. (I2.2[) might not hold [23] . 

The close connection between inverse iteration and the MRRR algorithm is dis- 
cussed in a number of articles [301 116] . Here we only mention three differences: (1) 
Instead of representing tridiagonal matrices by their diagonal and subdiagonal entries, 
MRRR uses Relatively Robust Representations (RRRs); these are representations with 
the property that small relative perturbations to the data result in small relative per- 
turbations to the eigenvalues [3H]; (2) the selection of a nearly optimal right-hand 
side vector for a one-step inverse iteration [35]; (3) the use of so-called twisted 
factorizations to solve linear systems |35j . 

In the following, we briefly discuss both the mathematical foundations of the 
MRRR algorithm, and how parallelism is organized in PMRRR. 

3.1.1. The MRRR algorithm. At first, a representation RRRo is chosen so 
that it defines all the desired eigenpairs in a relatively robust way [TH For 
example, a factorization of the form LqDqL^ = T — crl is an RRR for all of the 
eigenvalues [36j : here, Dq is diagonal, Lq is lower unit bidiagonal, and a is a scalar such 
that T — o~I is definite. As a second step, bisection is used to compute approximations 
to each eigenvalue Xj. At the cost of 0(n) flops, bisection guarantees that each 
eigenvalue is computed with high relative accuracy |31) . 

At this point, for all the eigenvalues Xj that have larg^H relative gaps, i.e., 
relgap(Aj) = min,^ |Aj — Aj|/|Aj| > tol, it is possible to compute their corresponding 

1 In particular, \\j — Xj\ < max^j — Xj\ is required. 

12 Without loss of generality, we will assume that T is numerically irreducible: No off-diagonal 
element is smaller in magnitude than a certain threshold. In context of the dense eigenproblem, the 
threshold is usually set to e||T||. 

13 In practice, a relative gap greater than 10 — 3 is considered sufficiently large. 
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eigenvectors: By using one step of inverse iteration with a twisted factorization, the 
acute angle Z(zj,zj) between the computed eigenvector Zj and true eigenvector Zj 
satisfies 



The denominator in Eq. p. 21) is the reason for which the process can be applied only to 
well-separated eigenvalues - measured by its relative gap. Such well-separated eigen- 
values are called singletons. Indeed, for singletons the computed eigenvectors have a 
small angle to the true eigenvector and consequently are numerically orthogonal. 

In contrast, when a set of consecutive eigenvalues have small relative gaps - that 
is, they form a cluster - the current RRR cannot be used to obtain numerically 
orthogonal eigenvectors. Instead, one exploits the fact that the relative gap is not 
invariant to matrix shifts. The algorithm then proceeds by constructing a new RRR 
{L c , D c } in a mixed relatively stable way: L C D C L^ — LqDqLq — <j c I . By shifting, the 
relative gaps are modified by a factor | Xj |/ 1 Aj — cr c | ; thus o~ c is chosen close to one of the 
eigenvalues in the cluster, so that at least one of them becomes well-separated. Once 
the new RRR is established, the eigenvalues in the cluster are refined to high relative 
accuracy with respect to the new RRR and classified as singletons and clusters. If 
all eigenvalues in the cluster are singletons, then the representation {L c , D c } can be 
used to compute a set of orthogonal eigenvectors for a slightly perturbed invariant 
subspace of {L , D }. If instead some eigenvalues are still clustered, the procedure is 
repeated recursively until all desired eigenpairs are computed. 

As a detailed discussion of the MRRR algorithm is outside the scope of this article, 
for further information we refer the readers to [33 Q31 [HI HH [52] and the references 
therein. 

3.1.2. PMRRR's parallelism. Our parallelization strategy consists of two lay- 
ers: a global and a local one. At the global level, the k eigenvalues and eigenvectors 
are statically divided into equal parts and assigned to the processes. Since the unfold- 
ing of the algorithm depends on the spectrum, it is still possible that the workload is 
not perfectly balanced among the processes. At the local level (within each process), 
the computation is decomposed into tasks that can be executed in parallel by multiple 
threads. The processing of these tasks leads to the dynamic generation of new tasks 
and might involve communication with other processes. The newly generated tasks 
are then likewise enqueued. 

When executed with nproc processes, the algorithm starts by broadcasting the 
input matrix and by redundantly computing the initial representation RRRq. Once 
this is available, the computation of the approximations Xj of k eigenvalues of LoDqLq 
is embarrassingly parallel: Each process is responsible for at most epp = [k /nproc] 
eigenvalues!^ similarly, each of the nthread threads within a process has the task to 
compute at most ept — \eppj 'nthread] eigenvalues. The processes then gather all the 
eigenvalues, and the corresponding eigenpairs are assigned as desired. 

Locally, the calculation of the eigenvector^ is split into computational tasks of 
three types: a set of singletons, clusters that require no communication, and clusters 



14 PMRRR (ver. 0.6) computes approximations to all n eigenvalues at this stage, such that epp = 
\n/ nproc] . 

15 We only refer to the calculation of the eigenvectors here, although the eigenvalues are also 
modified in this part of the computation. 




(3.2) 
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that require communication with other processes. The computation associated with 
each of the three types is detailed below. 

1. A set of singletons. The corresponding eigenvectors are computed locally. No 
further communication among processes is necessary. 

2. A cluster requiring no communication. When the cluster contains eigenvalues 
assigned to only one process, no cooperation among processes is needed. The 
four necessary steps are the same as those for the cluster task in [38]: A 
new RRR is computed; the eigenvalues are refined to relative accuracy with 
respect to the new RRR; the refined eigenvalues are classified into singletons 
and clusters; the corresponding tasks are enqueued into the local work queue. 

3. A cluster requiring communication. When the cluster contains a set of eigen- 
values which spans multiple processes, inter-process communication is needed. 
In this case, all the processes involved perform the following steps: a new 
RRR is computed redundantly, the local set of eigenvalues is refined, and the 
eigenvalues of the cluster are gathered and reclassified. 

Multithreading support is easily obtained by having multiple threads dequeue and 
execute tasks. The tasks are dynamically generated: their number and size highly 
depends on the spectral distribution of the input matrix; for this reason, the execution 
time for matrices of the same size may differ noticeably. By contrast, the memory 
requirement is matrix independent (0(nk/nproc) floating point numbers per process), 
and perfect memory balance is achieved [5j [50] . 

Our tests show that the hybrid parallelization approach is equally or slightly faster 
than the one purely based on MPI. This is generally true for architectures with a high 
degree of inter-node parallelism and limited intra- node parallelism. By contrast, on 
architectures with a small degree of inter-node parallelism and high degree of intra- 
node parallelism, we expect the hybrid execution of PMRRR to be preferable to pure 
MPI. 

We stress that even when no multithreading is used, the task-based design of 
PMRRR can be advantageous: By scheduling tasks that require inter-process commu- 
nication with priority and using non-blocking communication, processes can continue 
executing tasks while waiting to receive data. This strategy often leads to a perfect 
overlap of communication and computation. 

4. Experiments. In the next two sections we present experimental results for 
the execution on two state-of-art supercomputers at the Research Center Julich, Ger- 
many: Juropa and Jugene. 

4.1. Juropa. Juropa consists of 2208 nodes, each comprising two Intel Xeon 
X5570 Nehalem quad-core processors running at 2.93 GHz with 24 GB of memory. 
The nodes are connected by an Infiniband QDR network with a Fat-tree topology. 

All tested routines were compiled using the Intel compilers (ver. 11.1) with the flag 
-03 and linked to the ParTec's ParaStation MPI library (ver. 5.0.23)0 Generally, we 
used a two-dimensional process grid P r x P c (number of rows x number of columns) 
with P r = P c whenever possible, and P c = 2P r otherwisd^. If not stated otherwise, 
one process per core was employed. 

The ScaLAPACK library (ver. 1.8) was used in conjunction with Intel's MKL 



16 Version 5.0.24 was used when support for multithreading was needed. 

17 As discussed in Sections 12.21 and [3] P c ~ P r or the largest square grid possible should be 
preferred. These choices do not affect the qualitative behavior of our performance results. 
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BLAS (ver. 10.2)0 From extensive testing, we identified that in all cases the optimal 
block size was close to 32; therefore we carried out the ScaLAPACK experiments only 
with block size of 16, 32, 48; the best result out of this pool is then reported. 

Since no driver for the generalized eigenproblem that makes use of DC is avail- 
able, we refer to ScaLAPACK 's DC as the sequence of routines PZPOTRF-PZHENGST- 
PZHENTRD— PDSTEDC-PZUNMTR-PZTRSM, as listed in Table ED Similarly, we refer to 
ScaLAPACK's MRRR as the same sequence with PDSTEDC replaced by PDSTEGR0 

We stress that the slow routines PZHEGST and PZHETRD, for the reduction to stan- 
dard and tridiagonal form, respectively, were not used, and instead replaced by the 
faster PZHENGST and PZHENTRD. In order to make use of ScaLAPACK's fast reduc- 
tion routines, only the lower triangular part of the matrices is referenced and enough 
memory for a possible redistribution of the data is provided. 

Elemental (ver. 0.6) - incorporating PMRRR (ver. 0.6) - was used for the Ele- 
MRRR timings. In general, since Elemental does not tie the algorithmic block size to 
the distribution block size, different block sizes could be used for each of the stages. 
We do not exploit this fact in the reported timings. Instead the same block size is 
used for all stages. A block size of around 96 was in all cases optimal, therefore ex- 
periments were carried out for block sizes of 64, 96 and 128, but only the best timings 
are reportedF°l 

Since the timings of the tridiagonal eigenproblem depend on the input data, so 
does the overall solver. In order to compare fairly different solvers, we fixed the 
tridiagonal input matrix by using the 1-2-1 type. The performance of every other 
stage is data independent. Moreover, since the output of the tridiagonal solvers has to 
be in a format suitable for the backtransformation, the MRRR-based routines have to 
undergo a data redistribution; in all the experiments, the timings for Stage 4 include 
the cost of the redistribution. 

In the next two subsections we show results for both strong and weak scaling. In 
all experiments the number of nodes are increased from 8 to 256. Since each node 
consists of two quad-core processors, this corresponds to 64 to 2048 cores. 

4.1.1. Strong Scaling. We present timings of EleMRRR for the computation 
of all the eigenpairs of the generalized Hermitian eigenproblem Ax = XBx for fixed 
problem size, n = 20,000. The results are displayed in Fig. 14.11 As a reference, we 
have included timings for ScaLAPACK's solverso The right side of the figure shows 
the parallel efficiency - defined as e p = (t re f ■ p re f)/{t p ■ p) - where t p denotes the 
execution time on p processors, and t re f the execution time on p re f processors. The 
reference time used 8 nodes (64 cores). 

Once the proper sequence of routines is rectified, the performance of ScaLA- 
PACK's solvers is comparable to that of EleMRRR, up to 512 cores (see Fig. 14. ip . 
For 64 to 512 cores DC is about 10% to 40% slower than EleMRRR, while ScaLA- 
PACK's MRRR is about 7% to 20% slower. The advantage of EleMRRR mainly 



18 At the time of writing, ScaLAPACK's latest version was 1.8. The current version, 2.0, presents 
no significant changes in the tested routines. 

19 As PDSTEGR is not contained in ScaLAPACK, it corresponds to the sequence PZPOTRF-PZHENGST- 
PZHEEVR-PZTRSM. 

20 The block size for matrix vector products were fixed to 32 in all cases. For the biggest matrices 
in the weak scaling experiment only the block size of 32 and 96 were used for ScaLAPACK and 
EleMRRR, respectively. 

21 We did not investigate the cause for the increased run time of ScaLAPACK using 1024 and 
2048 cores. While most subroutines in the sequence are slower compared with the run time using 
512 cores, PZHENTRD scales well up to the tested 2048 cores - see also Fig. 12.21 
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Fig. 4.1. Strong scalability for the computation of all eigenpairs. Matrices A and B are of 
size 20,000. The dashed lines refer to ScaLAPACK's solvers when the matrices A and B are stored 
in the upper triangular part; in this scenario, the non-square routines for the reductions are used. 
Left: Total execution time in a log-log scale. Right: Parallel efficiency; normalized to the execution 
using 64 cores. 

comes from the Stages 1, 2 and 6, i.e., those related to the generalized eigenproblem. 
The timings for the standard problem (Stages 3-5) are nearly identical, with DC 
slightly slower than both MRRR-based solutions. 

The story for 1024 and 2048 cores changes; the performance of ScaLAPACK's rou- 
tines for the generalized eigenproblem drop dramatically. Compared to DC, EleMRRR 
is about 3.3 and 6.3 times faster; with respect to ScaLAPACK's MRRR instead, Ele- 
MRRR is 2.7 and 4.7 times faster. The same holds for the standard eigenproblem, 
where EleMRRR is about 2.9 and 6.2 times faster than DC and 1.9 and 3.7 times 
faster than MRRR. 

A study on the Juropa supercomputer suggests that the run time of applications 
can be greatly affected by settings of the underlying MPI implementation [33] . In par- 
ticular, the switch from static - the default setting - to dynamic memory allocation 
for MPI connections may result in improved performance. In regards to the perfor- 
mance degradation appearing in Fig. 14.11 (left), such a switch positively impacts some 
of the stages, including PDSTEDC; on the other hand it gravely worsens other stages, 
including PDSTEGR. Overall, ScaLAPACK's DC would only marginally improve, while 
MRRR's performance would greatly deteriorate. As a consequence, we employ the 
default MPI settings in all later experiments. 

In Fig. 14. 21 we take a closer look at the six different stages of EleMRRR. The left 
panel tells us that roughly one third of EleMRRR' s execution time - corresponding to 
Stages 4, 5 and 6 - is proportional to the fraction of computed eigenpairs. Computing 
a small fraction of eigenpairs would therefore require roughly about two thirds of 
computing the complete decomposition. On the right-hand panel of Fig. l4~2l we report 
the parallel efficiency for all six stages separately. When analyzed in conjunction with 
the left figure, this panel indicates if and when a routine becomes a bottleneck due to 
bad scaling. 

The tridiagonal eigensolver (Stage 4) obtains the highest parallel efficiency On 
the other hand, it contributes for less than 2.2% to the overall run time and is therefore 
negligible in all experiments. 

Up to 1024 cores, ScaLAPACK's MRRR shows a similar behavior: the tridiagonal 
stage makes up for less than 6% of the execution time. With 2048 cores instead, the 
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Fig. 4.2. EleMRRR's strong scalability for the computation of all eigenpairs. Similar results 
for ScaLAPACK's DC are shown in Fig. \2.4\ Left: Parallel efficiency for all six stages of the 
computation. Achieving a high parallel efficiency is especially important for the stages in which most 
part of the computation is spent. Right: Fraction of time spent in all six stages of the computation. 
The most time consuming stages are the reduction to tridiagonal form, the reduction to standard 
form, and the backtransformation to HEP, while the tridiagonal eigensolver is negligible. The time 
spent in the last three stages is proportional to the number of eigenpairs computed. 



percentage increases up to 21%. The situation is even more severe for DC, as the 
fraction spent in the tridiagonal stage increases from about 4.5% with 64 cores to 
41% with 2048 cores. This analysis suggests that the tridiagonal stage, unless as 
scalable as the other stages, will eventually account for a significant portion of the 
execution time. 

4.1.2. Weak Scaling. Fig. 14.31 illustrates EleMRRR's timings for the com- 
putation of all the eigenpairs of Ax = XBx. In this experiment, the matrix size 
increases (from 14,142 to 80,000) together with the number of cores (from 64 to 
2048). The right panel of the figure contains the parallel efficiency, defined as 

&p \Pref ' Pref ' ^ 

)/(t p ■ p ■ n re f), where all the quantities are like in the previ- 
ous section and n re f denotes the smallest matrix size in the experiment (14,142). 

In the tests using 512 cores and less, EleMRRR outperforms ScaLAPACK only 
by a small margin, while using 1024 cores and more, the difference becomes signif- 
icant. The right graph indicates that EleMRRR scales well to large problem sizes 
and high number of processes, with parallel efficiency close to one. Thanks to its 
better scalability, for the biggest problem EleMRRR is 2.1 and 2.5 times faster than 
ScaLAPACK's MRRR and DC, respectively. 

The execution time is broken down into stages in Fig. 14.41 (left). Four comments 
follow, (a) The time spent in PMRRR (Stage 4) is in the range of 2.5% to 0.7% and 
it is is completely negligible, especially for large problem sizeslfl (b) The timings 
corresponding to the standard eigenproblem (Stages 3-5) account for about 72% of 
the generalized problem's execution time, (c) The part of the solver whose execution 
time is roughly proportional to the fraction of desired eigenpairs (Stages 4-6) makes 
up 32%-37% of the execution for both the GHEP and HEP. (d) No one stage in 
EleMRRR is becoming a bottleneck, as all of them scale equally well. 

Fig. 14.41 (right) shows the execution of EleMRRR using one process per socket 
with four threads per process. The resulting execution time is roughly the same as 



The timings relative to only Stage 4 are detailed in Fig. 12.31 
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Fig. 4.3. Weak scalability for the computation of all eigenpairs. Matrices A and B are of 
varying size. The dashed lines refer to ScaLAPACK's solvers when the matrices A and B are stored 
in the upper triangular part; in this scenario, the non-square routines for the reductions are used. 
Left: Total execution time. Right: Parallel efficiency relative to 64 cores. All three routines are 
fast and efficient up to 512 cores. For higher numbers of nodes the efficiency of the ScaLAPACK 
routines drops, while EleMRRR retains almost perfect scalability. 




Fig. 4.4. EleMRRR's weak scalability for the computation of all eigenpairs. Left: Fraction of 
the execution time spent in the six stages, from bottom to top. Right: Comparison between a pure 
MPI execution and a hybrid execution using one process per socket with four threads per process. 



for the pure MPI execution, highlighting the potential of Elemental's hybrid mode. 
Similar experiments with a higher degree of multithreading would positively affect 
PMRRR, but not the reduction to tridiagonal form. 

4.2. Jugene. In this section, we verify the prior results obtained on Juropa for a 
different architecture - namely, the BlueGene/P installation Jugene. Jugene consists 
of 73,728 nodes, each of which is equipped with 2 GB of memory and a quad core 
PowerPC 450 processor running at 850 MHz. 

All routines were compiled using the IBM XL compilers (ver. 9.0) in combination 
with the vendor tuned IBM MPI library. We used a square processor grid P r = P c 
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whenever possible and P r = 2P C otherwise!^ Similarly, ScaLAPACK (ver. 1.8) in con- 
junction with the vendor-tuned BLAS included in the ESSL library (ver. 4.3) was used 
throughout^ In contrast to the Juropa experiments, we concentrate on the weak scal- 
ability of the symmetric- definite generalized eigenproblem. Therefore ScaLAPACK's 
DC timings correspond to the sequence of routines PDPOTRF-PDSYNGST— PDSYNTRD- 
PDSTEDC PDORMTR PDTRSM. Accordingly, ScaLAPACK's MRRR corresponds to the 
same sequence of routines with PDSTEDC replaced by PDSTEGR0 In both cases, a 
block size of 48 was found to be nearly optimal and used in all experiments. As 
already explained in Section 14.11 we avoided the use of the routines PDSYGST and 
PDSYTRD for the reduction to standard and tridiagonal form, respectively. 

For EleMRRR's timings we used Elemental (ver. 0.66), which integrates PM- 
RRR (ver. 0.6). A block size of 96 was identified as nearly optimal and used for all 
experiments! 26 ! 

4.2.1. Weak Scaling. In the left panel of Fig. 1431 we present EleMRRR's tim- 
ings for the computation of all eigenpairs of the generalized problem in the form of 
Ax — XBx. While the size of the test matrices ranges from 21,214 to 120,000, the 
number of nodes increases from 64 to 2,048 (256 to 8,192 cores). In the right panel, 
the execution time is broken down into the six stages of the generalized eigenproblem. 




Fig. 4.5. Weak scalability for computing all eigenpairs of Ax = XBx. The dashed lines refer 
to ScaLAPACK's solvers when the matrices A and B are stored in the upper triangular part; in this 
scenario, the non-square routines for the reductions are used. Left: Total execution time. EleMRRR 
is the fastest solver across the board. Right: Fraction of the execution time spent in the six stages, 
from bottom to top. 

Both graphs show a similar behavior to the experiments performed on Juropa. 
In all experiments EleMRRR outperforms both the ScaLAPACK's solvers. Most 
importantly, ScaLAPACK again suffers from a breakdown in scalability. 



23 As noted in Section l4.ll P c fa P r or the largest square grid possible should be preferred. These 
choices do not affect the qualitative behavior of our performance results. 

24 At the time of writing, ScaLAPACK's up-to-date version was 1.8. The current version, 2.0, 
presents no significant changes in the tested routines. 

25 As PDSTEGR is not contained in ScaLAPACK, it corresponds to the sequence PDPOTRF-PDSYNGST- 
PDSYEVR-PDTRSM. 

26 The block size for matrix vector products, which does not have a significant influence on the 
performance, was fixed to 64 in all cases. 
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When analyzing the results of Fig. 14.51 for ScaLAPACK's solver we the following 
observations, (a) The scalability issues can be attributed mainly to Stages 1, 2 and 4. 
In particular, for the largest experiment ScaLAPACK's reduction to standard form 
(28 minutes) and MRRR (25 minutes) each exceed the time that EleMRRR spends 
for the entire problem, (b) Although ScaLAPACK's tridiagonal eigensolver is usually 
not very time consuming, for highly parallel systems it might become the bottleneck. 
As an example, in our experiment on 8,192 cores the tridiagonal problem accounts for 
54% (MRRR) and 33% (DC) of the total execution time of the standard eigenproblem. 
(c) Due to better scalability, DC becomes faster than ScaLAPACK's MRRR. 

We conclude with four comments regarding EleMRRRs behavior, (a) While stage 
4 of ScaLAPACK's MRRR and DC take up to 28% and 20% of the total execution 
time, respectively, PMRRR accounts for less than 4%. In particular, for the largest 
problem 25 minutes were spent in ScaLAPACK's MRRR, whereas PMRRR required 
only 20 seconds. In all experiments PMRRR's execution time was negligible, (b) 
The timings corresponding to the standard eigenproblem account for 70%-74% of the 
generalized problem's execution time, (c) The part which is roughly proportional 
to the fraction of desired eigenpairs makes up 26%-32% of both the generalized and 
standard problem, (d) All the six stages scale equally well and no computational 
bottleneck emerges. 

5. Conclusions. Our study of dense large-scale generalized and standard eigen- 
problems was motivated by performance problems of commonly used routines in the 
ScaLAPACK library. In this paper we identify such problems and provide clear guide- 
lines on how to circumvent them: by invoking suitable routines with the right settings, 
the users can assemble solvers faster than those included in ScaLAPACK. 

The main contribution of the paper lies in the introduction of Elemental's dense 
eigensolvers and our tridiagonal eigensolver, PMRRR. Together, they provide a set of 
routines - labeled EleMRRR - for large-scale eigenproblems. These solvers are part of 
the publicly available Elemental library and make use of PMRRR, a parallel version 
of the MRRR algorithm for computing all or a subset of eigenpairs of tridiagonal 
matrices. PMRRR supports pure message-passing, pure multithreading, as well as 
hybrid executions, and our experiments indicate that it is among the fastest and most 
scalable tridiagonal eigensolvers currently available. 

In a thorough performance study on two state-of-the-art supercomputers, we com- 
pared EleMRRR with the solvers built within the ScaLAPACK framework according 
to our guidelines. For a modest amount of parallelism, ScaLAPACK's solvers obtain 
results comparable to EleMRRR, provided the fastest routines with suitable settings 
are invoked. In general, EleMRRR attains the best performance and obtains the best 
scalability of all solvers. 
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